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AMENDMENTS TO THE CLAIMS 

The Assignee submits below a complete listing of the current claims, including marked-up 
claims with insertions indicated by underlining and deletions indicated by strikeouts and/or double 
bracketing. This listing of claims replaces all prior versions and listings of claims in the application: 

1 . (Currently amended) A program storage device readable by a machine, tangibly 
embodying a program of insteuctions executable by the machine to perform a method for speech 
synthesis that allows user specified pronunciations, the method comprising: 

extracting prosodic parameter[[s]] values from an audio signal corresponding to a 
pronunciation of a text string by a user; 

extracting duration parameter[[s]] values from the audio signal by aligning the audio signal 
with the text siring; 

automatically generating at loaot one text to spooch (TTS) input using tho prooodio 
paramet e rs and the duration paramet e rs, wh e rein the at loaot one TTS input is formatt e d for use in 
synthesizing sp e ech from the t e xt string; and 

adopting as synthesis parameter values the prosodic parameter values and duration 
parameter values extracted from the audio signal: and 

generating a synthetic speech waveform using the at least on e TTS input synthesis parameter 

values . 

2-3. (Canceled) 

4. (Previously Presented) The program storage device of claim 1, wherein the instructions for 
aligning comprise instructions for segmenting the audio signal into time-segmented regions, 
wherein each time-segmented region is mapped to a corresponding phoneme. 

5 . (Previously Presented) The program storage device of claim 1 , wherein the alignment is 
performed using a Viterbi alignment process. 
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6. (Canceled) 

7. (Currently amended) The program storage device of claim 1 , wherein the instructions for 
automatically generating at loaot ono TTS input the synthetic speech waveform comprise 
instructions for directly specifying at least one portion of the duration param e t e rs and/or the 
prosodio synthesis parameter [[s]] values as attribute values for mark-up elements. 

8. (Currently amended) The program storage device of claim 1, wherein the instructions for 
automatically generating at loaot ono TTS input the synthetic speech waveform comprise 
instructions for assigning abstract labels to at least one portion of the duration paramotoro and/or th e 
prosodio synthesis parameter[[s]] values to generate a high-level markup of the text string . 

9. (Currently amended) The program storage device of claim 1 , wherein the at least one TTS 
input is generated generating comprises generating a markup of the text string using SSML (speech 
synthesis markup language). 

1 0. (Previously Presented) The program storage device of claim 1 , further comprising 
instructions for processing phonetic content of the audio signal to generate the synthetic speech 
waveform having a desired pronunciation. 

1 1 . (Currently amended) A method for speech synthesis that allows user specified 
pronunciations, the method comprising: 

extracting prosodio parameter[[s]] values from an audio signal corresponding to a 
pronunciation of a text string by a user; 

extracting duration parameter[[s]] values from the audio signal by aligning the audio signal 
with the text string; 
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automatically gen e rating at least one text to speech (TTS) input using the prosodic 
paramet e rs and the duration paramet e rs, wher e in the at l e ast on e TTS input is formatted for us e in 
synth e si2dng spe e ch from th e t e xt string; and 

adopting as synthesis parameter values the prosodic parameter values and duration 
parameter values extracted from the audio signal: and 

generating a synthetic speech waveform using the at least on e TTS input synthesis parameter 

values . 

12-13. (Canceled) 

14. (Previously Presented) The method of claim 1 1 , wherein aligning comprises extracting 
acoustic feature data from the audio signal and time-aligning the audio signal to the text string 
using the acoustic feature data. 

15. (Previously Presented) The method of claim 1 1 , wherein aligning is performed using a 
Viterbi alignment process. 

16. (Canceled) 

1 7. (Currently amended) The method of claim 1 1 , wherein automatically generating at least one 
TTS input the synthetic speech waveform comprises directly specifying at least one portion of the 
duration parameters and/or tii e prosodic synthesis parameter[[s]] values as attribute values for mark- 
up elements. 

18. (Currentiy amended) The method of claim 1 1 , wherein automatically generating at l e ast on e 
TTS input the synthetic speech waveform comprises assigning abstract labels to at least one portion 
of the duration paramet e rs and/or th e prosodic synthesis parameter [[s]] values to generate a high- 
level markup of the text string . 
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1 9. (Cvirrently amended) The method of claim 1 1 , wherein the at least on e TTS input is 
gen e rat e d generating comprises generating a markup of the text string using SSML (speech 
synthesis markup language). 

20. (Previously Presented) The method of claim 1 1 , further comprising processing phonetic 
content of the audio signal to generate the synthetic speech waveform having a desired 
pronunciation. 

2 1 . (Currently amended) A text-to-speech (TTS) system that allows user specified 
pronunciations, comprising: 

a prosody analyzer for determining prosodio svnthesis parameter[[s]] [[of]] values from an 
audio signal corresponding to a pronunciation by a user of an input text string and automatically 
gonorating at least one TTS oyotom input corr e oponding to tho audio oignal using tho prosodio 
param e ters , wherein the prosody analyzer comprises: 

a prosodic parameter extraction module for extracting [[the]] prosodio 
parameter[[s]] values fi:om the audio signal, 

an alignment module for extracting duration parameter[[s]] values from the 
audio signal by aligning the input text string with the audio signal, and 

a conversion module for gonorating tho at least ono TTS system input using 
ihn prnr . nHir p.nrnmntnrr. nnd tho duration paramet e rs adopting as svnthesis parameter 
values the prosodic parameter values and duration parameter values extracted from 
the audio signal; and 

a TTS system engine for generating a synthetic speech waveform using the at least one TTS 
system input svnthesis parameter values . 

22. (Previously Presented) The system of claun 21, further comprising a user interface that 
enables a user to input the audio signal and the input text string corresponding to the audio signal. 
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23 . (Previously Presented) The system of claim 2 1 , wherein the prosody analyzer processes 
phonetic content of the audio signal to generate the synthetic waveform having a desired 
pronunciation. 

24-28. (Canceled) 

29. (Previously Presented) The program storage device of claim 33, wherein extracting acoustic 
feature data from the audio signal comprises digitizing the audio signal into a set of frames and 
transforming the digitized audio signal into a set of feature vectors on a frame-by-frame basis. 

30. (Previously Presented) The program storage device of claim 29, wherein transforming the 
digitized audio signal comprises producing a 24-dimensional cepstra feature vector for every 10ms 
of the audio signal, concatenating frames to the left and to the right of a current frame to augment a 
current cepstral vector, and reducing each augmented cepstral vector to a 60-dimensional feature 
vector using linear discriminant analysis. 

3 1 . (Previously Presented) The method of claim 34, wherein extracting acoustic feature data 
from the audio signal comprises digitizing the audio signal into a set of frames and transforming the 
digitized audio signal into a set of feature vectors on a frame-by-frame basis. 

32. (Previously Presented) The method of claim 3 1 , wherein transforming the digitized audio 
signal comprises producing a 24-dimensional cepstra feature vector for every 10ms of the audio 
signal, concatenating frames to the left and to the right of a current frame to augment a current 
cepstral vector, and reducing each augmented cepstral vector to a 60-dimensional feature vector 
using linear discrimmant analysis. 

33. (Previously Presented) The program storage device of claim 1 wherein the method fiirther 
comprises extracting acoustic feature data from the audio signal and wherein the aligning fiirther 
comprises outputting a set of duration contours. 
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34. (Previously Presented) The method of claim 1 1 further comprising extracting acoustic 
feature data from the audio signal and wherein the aligning further comprises outputting a set of 
duration contours. 

35. (Previously Presented) The system of claim 21 wherein the prosody analyzer further 

comprises an acoustic feature extraction module that extracts acoustic feature data from the audio 
signal and wherein the alignment module uses said acoustic feature data to perform the aligning. 
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